Leaving these here as notes, but I feel comfortable with the decisions I’ve made.
Figure 1. Map showing the collection locations of
470 specimens available after basic initial filtering. Colors represent
“population cluster” as assigned by tree method (used
hclust and cutree; tree length = 100km).
Hover over the dots to see what cluster they were assigned
to!
See Shalene’s paper on page 998 for inspiration on how to deal with this…Jha et al. 2015 Contemporary human-altered landscapes and oceanic barriers reduce bumble bee gene flow
## chi^2 df Pr(chi^2 >) Pr.exact
## btern01 97.66759 36 1.343170e-07 0.000
## bt28 59.31202 10 4.889876e-09 0.000
## b96 105.11921 28 7.376899e-11 0.000
## bt30 113.92536 66 2.288810e-04 0.000
## btms0081 13.89628 3 3.049781e-03 0.003
## btms0066 421.14552 276 3.999226e-08 0.000
## btms0083 1010.34228 465 0.000000e+00 0.000
## b126 313.28115 120 0.000000e+00 0.000
## btms0062 1146.60599 561 0.000000e+00 0.000
## btern02 954.22326 300 0.000000e+00 0.000
## btms0086 50.77823 3 5.454315e-11 0.000
## bl13 58.50354 3 1.227130e-12 0.000
## btms0059 112.09174 36 9.687809e-10 0.001
There is evidence that, globally, loci are out of HWE. But this is due to differences between subpopulations. Shalene handles this nicely in the paper noted above, so it’s something to report…but not really anything to worry about downstream.
LD between loci - not likely an issue here (p.rD and p.Ia > 0.05; no consistent linkage in the pairwise comparison either).
## Ia p.Ia rbarD p.rD
## 0.054012341 0.053000000 0.004580062 0.053000000
Null allele frequences are within the range typical for pop gen analysis with microsatellites and unlikely to be an issue. Can cite https://www.nature.com/articles/6800545 and write it like they do in https://peerj.com/articles/13565/ where they say, “Nevertheless, the frequency of null alleles inferred with the methods of Brookfield and Chakraborty (0.09 and 0.13, respectively) is in line with values commonly reported in the literature and is unlikely to cause a major bias in downstream population structure analyses (Dakin & Avise, 2004).”
## btern01 bt28 b96 bt30 btms0081
## Observed frequency 0.07366541 0.07133358 0.09233988 0.09316057 0.06801660
## Median frequency 0.07226019 0.07150322 0.09097403 0.09317786 0.06711364
## 2.5th percentile 0.03601541 0.02964520 0.05142277 0.05409726 0.02201699
## 97.5th percentile 0.11281467 0.11534776 0.13633140 0.13493682 0.11317655
## btms0066 btms0083 b126 btms0062 btern02
## Observed frequency 0.06941596 0.13619032 0.09291617 0.12542148 0.1480633
## Median frequency 0.06895868 0.13399140 0.09147973 0.12406915 0.1455649
## 2.5th percentile 0.03505703 0.09892215 0.05678336 0.08843347 0.1080696
## 97.5th percentile 0.10267418 0.17449799 0.13111724 0.16409324 0.1863372
## btms0086 bl13 btms0059
## Observed frequency 0.05457863 0.08153726 0.06513655
## Median frequency 0.05238583 0.07891641 0.06337953
## 2.5th percentile 0.01546550 0.03615517 0.02414679
## 97.5th percentile 0.09448964 0.12691457 0.10382729
There are no loci with <80% completeness
## named numeric(0)
There are no individuals with poor data in the dataset…but that is because I also excluded them earlier in the process. So this is just a sanity check.
## named numeric(0)
There are no duplicates or clones in the dataset. It’s possible that there are prior to the COLONY filtering, so if re-running analysis keeping siblings in the dataset, double check.
## #############################
## # Number of Individuals: 231
## # Number of MLG: 231
## #############################
## [1] 231
Just a little sanity check that all the loci are polymorphic (they are)
## Mode TRUE
## logical 13
From the summary call to the genind object, we find the
number of specimens per cluster, the number of private alleles (which is
not very informative as it’s correlated with sample size), and the
rarefied allelic richness (which may be informative). It’s quite
debatable how well the rarefied richness does when dealing with such
uneven population sizes. This is a case where re-running with different
clusters might be more effective (i.e. splitting apart the big ones and
then dropping the small ones for this analysis)
Only displaying populations with 5 or more specimens.
Figure 2. Observed (blue) versus expected (grey) heterozygosity across the regions. Numbers above bars are the sample size for each region.
Same (except ALL clusters, not just those with >=5 specimens) as figure but in table format for investigating to your heart’s content. Added some math for a little helper.
SAME AS BELOW BUT LOOKING AT He BUT BE CAUTIOUS UNTIL RE-DOING THIS AT STRATA LEVEL modeling He with random effects structure NEED TO DOUBLE CHECK IS IS APPROPRIATE I assigned regional clusters here, and took the mean of the Ar across sites within the region…but it might be wrong to do that and might need to calculate that on the regions themselves at a higher level (i.e. by defining strata in genind object)
Would also be more interesting if there was an actual hypothesis here rather than just are the regions different…
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: he ~ long_m_center + lat_m_center + (1 | locus)
## Data: df_hs
##
## AIC BIC logLik deviance df.resid
## -230.0 -214.3 120.0 -240.0 164
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.3771 -0.4835 0.1428 0.4909 2.9513
##
## Random effects:
## Groups Name Variance Std.Dev.
## locus (Intercept) 0.04642 0.2155
## Residual 0.01034 0.1017
## Number of obs: 169, groups: locus, 13
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.684e-01 3.274e-01 1.125
## long_m_center 5.924e-08 4.465e-08 1.327
## lat_m_center 6.142e-08 6.547e-08 0.938
##
## Correlation of Fixed Effects:
## (Intr) lng_m_
## long_m_cntr -0.762
## lat_m_centr -0.982 0.757
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
## Data: df_hs
## Models:
## mod_null_he: he ~ 1 + (1 | locus)
## mod_he: he ~ long_m_center + lat_m_center + (1 | locus)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## mod_null_he 3 -232.19 -222.8 119.10 -238.19
## mod_he 5 -229.95 -214.3 119.98 -239.95 1.7602 2 0.4147
## R2m R2c
## [1,] 0.00191564 0.8181946
modeling allelic richness with random effects structure NEED TO DOUBLE CHECK IS IS APPROPRIATE I assigned regional clusters here, and took the mean of the Ar across sites within the region…but it might be wrong to do that and might need to calculate that on the regions themselves at a higher level (i.e. by defining strata in genind object)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: richness ~ long_m_center + (1 | locus)
## Data: df_ar
##
## AIC BIC logLik deviance df.resid
## 391.3 403.8 -191.7 383.3 165
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.9098 -0.5344 0.1968 0.5939 2.1752
##
## Random effects:
## Groups Name Variance Std.Dev.
## locus (Intercept) 1.8945 1.3764
## Residual 0.4124 0.6422
## Number of obs: 169, groups: locus, 13
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.755e+00 3.896e-01 9.638
## long_m_center 2.390e-07 1.844e-07 1.296
##
## Correlation of Fixed Effects:
## (Intr)
## long_m_cntr -0.154
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
## Data: df_ar
## Models:
## mod_null_ar: richness ~ 1 + (1 | locus)
## mod_ar: richness ~ long_m_center + (1 | locus)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## mod_null_ar 3 390.97 400.36 -192.49 384.97
## mod_ar 4 391.30 403.82 -191.65 383.30 1.6705 1 0.1962
## R2m R2c
## [1,] 0.00178405 0.8215358
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: mean_fis ~ region + (1 | locus)
## Data: df_fis
##
## AIC BIC logLik deviance df.resid
## 50.0 65.5 -20.0 40.0 160
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.70211 -0.62476 -0.07558 0.62004 2.78966
##
## Random effects:
## Groups Name Variance Std.Dev.
## locus (Intercept) 0.0000 0.0000
## Residual 0.0746 0.2731
## Number of obs: 165, groups: locus, 13
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.03863 0.07575 0.510
## regionCentral 0.19941 0.07922 2.517
## regionTwin Cities 0.14898 0.10713 1.391
##
## Correlation of Fixed Effects:
## (Intr) rgnCnt
## regionCntrl -0.956
## reginTwnCts -0.707 0.676
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
## Data: df_fis
## Models:
## mod_null_fis: mean_fis ~ 1 + (1 | locus)
## mod_fis: mean_fis ~ region + (1 | locus)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## mod_null_fis 3 52.371 61.689 -23.185 46.371
## mod_fis 5 49.980 65.510 -19.990 39.980 6.3902 2 0.04096 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## R2m R2c
## [1,] 0.03821111 0.03821111
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lmer(formula = mean_fis ~ region + (1 | locus), data = df_fis,
## REML = FALSE)
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## Central - Appalachian == 0 0.19941 0.07922 2.517 0.0301 *
## Twin Cities - Appalachian == 0 0.14898 0.10713 1.391 0.3358
## Twin Cities - Central == 0 -0.05044 0.07922 -0.637 0.7936
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: he ~ region + (1 | locus)
## Data: df_hs_reg
##
## AIC BIC logLik deviance df.resid
## -47.6 -39.3 28.8 -57.6 34
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.2034 -0.2685 0.1869 0.3937 1.4637
##
## Random effects:
## Groups Name Variance Std.Dev.
## locus (Intercept) 0.043783 0.2092
## Residual 0.004198 0.0648
## Number of obs: 39, groups: locus, 13
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.683800 0.060753 11.255
## regioncentral 0.004731 0.025415 0.186
## regionnorth west 0.016215 0.025415 0.638
##
## Correlation of Fixed Effects:
## (Intr) rgncnt
## regioncntrl -0.209
## regnnrthwst -0.209 0.500
## Data: df_hs_reg
## Models:
## mod_null_he_reg: he ~ 1 + (1 | locus)
## mod_he_reg: he ~ region + (1 | locus)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## mod_null_he_reg 3 -51.174 -46.183 28.587 -57.174
## mod_he_reg 5 -47.601 -39.283 28.801 -57.601 0.4271 2 0.8077
## R2m R2c
## [1,] 0.0009905828 0.9125857
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lmer(formula = he ~ region + (1 | locus), data = df_hs_reg, REML = FALSE)
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## central - appalachian == 0 0.004731 0.025415 0.186 0.981
## north west - appalachian == 0 0.016215 0.025415 0.638 0.799
## north west - central == 0 0.011485 0.025415 0.452 0.894
## (Adjusted p values reported -- single-step method)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: ar ~ region + (1 | locus)
## Data: df_ar_reg
##
## AIC BIC logLik deviance df.resid
## 184.1 192.4 -87.0 174.1 34
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8636 -0.4774 -0.1023 0.7252 1.5097
##
## Random effects:
## Groups Name Variance Std.Dev.
## locus (Intercept) 23.685 4.867
## Residual 1.346 1.160
## Number of obs: 39, groups: locus, 13
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 7.5385 1.3876 5.433
## regioncentral 0.9615 0.4551 2.113
## regionnorth west 1.0744 0.4551 2.361
##
## Correlation of Fixed Effects:
## (Intr) rgncnt
## regioncntrl -0.164
## regnnrthwst -0.164 0.500
## Data: df_ar_reg
## Models:
## mod_null_ar_reg: ar ~ 1 + (1 | locus)
## mod_ar_reg: ar ~ region + (1 | locus)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## mod_null_ar_reg 3 186.06 191.06 -90.033 180.06
## mod_ar_reg 5 184.08 192.40 -87.040 174.08 5.9861 2 0.05013 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## R2m R2c
## [1,] 0.009437976 0.9467198
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lmer(formula = ar ~ region + (1 | locus), data = df_ar_reg, REML = FALSE)
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## central - appalachian == 0 0.9615 0.4551 2.113 0.0874 .
## north west - appalachian == 0 1.0744 0.4551 2.361 0.0479 *
## north west - central == 0 0.1129 0.4551 0.248 0.9667
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: fis ~ region + (1 | locus)
## Data: df_fis_reg
##
## AIC BIC logLik deviance df.resid
## -52.3 -44.0 31.1 -62.3 34
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.7970 -0.5892 0.0631 0.4268 2.2589
##
## Random effects:
## Groups Name Variance Std.Dev.
## locus (Intercept) 0.003491 0.05909
## Residual 0.009202 0.09593
## Number of obs: 39, groups: locus, 13
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.03863 0.03125 1.236
## regioncentral 0.19694 0.03763 5.234
## regionnorth west 0.15480 0.03763 4.114
##
## Correlation of Fixed Effects:
## (Intr) rgncnt
## regioncntrl -0.602
## regnnrthwst -0.602 0.500
## Data: df_fis_reg
## Models:
## mod_null_fis_reg: fis ~ 1 + (1 | locus)
## mod_fis_reg: fis ~ region + (1 | locus)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## mod_null_fis_reg 3 -36.161 -31.17 21.080 -42.161
## mod_fis_reg 5 -52.288 -43.97 31.144 -62.288 20.127 2 4.26e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## R2m R2c
## [1,] 0.3669564 0.5410691
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lmer(formula = fis ~ region + (1 | locus), data = df_fis_reg,
## REML = FALSE)
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## central - appalachian == 0 0.19694 0.03763 5.234 < 1e-04 ***
## north west - appalachian == 0 0.15480 0.03763 4.114 0.000127 ***
## north west - central == 0 -0.04214 0.03763 -1.120 0.501742
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
This is more zoomed in than the above map for clarity. The top map is intended to actually show the historic range (states). Perhaps a better way would be to actually zoom in somehow rather than trim states.
Map of relative contribution of different clusters across regions. Pie size scaled by sample size (not linearly!). There is a dot in the middle of each pie, that is just so when I eventually edit this in inkscape (because there is no clear way to “repel” scatterpies) I have the centroid retained
Ultimately, I’m going to put the above barplot into the mapping area in the top-right corner. Then the order of the populations in that barplot will match the left-to-right order of the pies (after repelling them a bit for visual clarity)
Both DAPC and STRUCTURE methods are most explanatory at k=3; similar results are obtained for k=4, but choosing 3 for sake of parsimony
## #################################################
## # Discriminant Analysis of Principal Components #
## #################################################
## class: dapc
## $call: dapc.genind(x = genind_rpbb, n.pca = 60, n.da = 20)
##
## $n.pca: 60 first PCs of PCA used
## $n.da: 12 discriminant functions saved
## $var (proportion of conserved variance): 0.884
##
## $eig (eigenvalues): 43.34 28.59 20.37 15.31 13.33 ...
##
## vector length content
## 1 $eig 12 eigenvalues
## 2 $grp 231 prior group assignment
## 3 $prior 13 prior group probabilities
## 4 $assign 231 posterior group assignment
## 5 $pca.cent 182 centring vector of PCA
## 6 $pca.norm 182 scaling vector of PCA
## 7 $pca.eig 168 eigenvalues of PCA
##
## data.frame nrow ncol content
## 1 $tab 231 60 retained PCs of PCA
## 2 $means 13 60 group means
## 3 $loadings 60 12 loadings of variables
## 4 $ind.coord 231 12 coordinates of individuals (principal components)
## 5 $grp.coord 13 12 coordinates of groups
## 6 $posterior 231 13 posterior membership probabilities
## 7 $pca.loadings 182 60 PCA loadings of original variables
## 8 $var.contr 182 12 contribution of original variables
## $call
## ade4::amova(samples = xtab, distances = xdist, structures = xstruct)
##
## $results
## Df Sum Sq Mean Sq
## Between pop 12 224.7766 18.731387
## Between samples Within pop 218 2356.6080 10.810128
## Within samples 231 1688.2162 7.308295
## Total 461 4269.6008 9.261607
##
## $componentsofcovariance
## Sigma %
## Variations Between pop 0.2402062 2.583024
## Variations Between samples Within pop 1.7509166 18.828239
## Variations Within samples 7.3082952 78.588737
## Total variations 9.2994180 100.000000
##
## $statphi
## Phi
## Phi-samples-total 0.21411263
## Phi-samples-pop 0.19327472
## Phi-pop-total 0.02583024
## class: krandtest lightkrandtest
## Monte-Carlo tests
## Call: randtest.amova(xtest = amova_rpbb, nrepet = 999)
##
## Number of tests: 3
##
## Adjustment method for multiple comparisons: none
## Permutation number: 999
## Test Obs Std.Obs Alter Pvalue
## 1 Variations within samples 7.3082952 -22.22865 less 0.001
## 2 Variations between samples 1.7509166 16.24033 greater 0.001
## 3 Variations between pop 0.2402062 11.98792 greater 0.001
Pairwise Fst heat map. Not sure what’s up with the SE Minnesota population or Green Bay…but there’s also not that many specimens from them (8 and 7, respectively)
Below is with only populations having >=5 specimens
##
## Call:
## lm(formula = Fst ~ distance, data = df_fst_join)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.032784 -0.014806 -0.004775 0.007285 0.093444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.704e-02 4.725e-03 3.607 0.000686 ***
## distance 3.819e-08 9.292e-09 4.110 0.000138 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0225 on 53 degrees of freedom
## Multiple R-squared: 0.2417, Adjusted R-squared: 0.2274
## F-statistic: 16.89 on 1 and 53 DF, p-value: 0.0001381
##
## Call:
## lm(formula = Fst ~ log(distance), data = df_fst_join)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.032783 -0.014908 -0.005903 0.005882 0.088357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.178853 0.048422 -3.694 0.000524 ***
## log(distance) 0.016773 0.003846 4.361 5.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02217 on 53 degrees of freedom
## Multiple R-squared: 0.2641, Adjusted R-squared: 0.2502
## F-statistic: 19.02 on 1 and 53 DF, p-value: 5.994e-05
Determine numbers of males using the minimum homozygosity method, discuss implications
## # A tibble: 4 × 3
## sex is_het n
## <chr> <chr> <int>
## 1 female no 20
## 2 female yes 327
## 3 male no 52
## 4 male yes 63
## [1] 0.1615385
## # A tibble: 4 × 3
## sex is_het n
## <chr> <chr> <int>
## 1 female no 29
## 2 female yes 318
## 3 male no 89
## 4 male yes 26
## [1] 0.0755814
## # A tibble: 4 × 3
## site threshold prop_diploid
## <chr> <dbl> <dbl>
## 1 MN Zoo 1 0.175
## 2 MN Zoo 2 0.114
## 3 Appalachian 1 0.523
## 4 Appalachian 2 0.3
MNZoo 2020
## # A tibble: 3 × 3
## sex is_het n
## <chr> <chr> <int>
## 1 female yes 18
## 2 male no 8
## 3 male yes 1
## [1] 0.05263158
## # A tibble: 3 × 3
## sex is_het n
## <chr> <chr> <int>
## 1 female yes 18
## 2 male no 8
## 3 male yes 1
## [1] 0.05263158
mnzoo 2021
## # A tibble: 4 × 3
## sex is_het n
## <chr> <chr> <int>
## 1 female no 1
## 2 female yes 15
## 3 male no 3
## 4 male yes 6
## [1] 0.2857143
## # A tibble: 4 × 3
## sex is_het n
## <chr> <chr> <int>
## 1 female no 3
## 2 female yes 13
## 3 male no 6
## 4 male yes 3
## [1] 0.1875
Insert summary table, case studies of MnZoo, Turtle Valley, and other sites with >15? 10? whatever
MN Zoo 2020
## # A tibble: 1 × 6
## n_genotyped n_colonies ml.colony.num CI.lower CI.upper prop_detected
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 18 9 16 9 31 0.562
MN Zoo 2021
## # A tibble: 1 × 6
## n_genotyped n_colonies ml.colony.num CI.lower CI.upper prop_detected
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 16 13 39 19.5 1000 0.333
Turtle Valley 2021
## # A tibble: 1 × 6
## n_genotyped n_colonies ml.colony.num CI.lower CI.upper prop_detected
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 42 27 66 49.5 121. 0.409